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Abstract. This paper presents a distance function between sets based on an 
average of distances between their elements. The distance function is a metric 
if the sets are non-empty finite subsets of a metric space. It can be applied 
to produce various metric spaces on collections of sets and will be useful for 
analyzing complex data sets in the fields of computer science and information 
science. Its generalizations to include the Hausdorff metric and extensions to 
infinite sets for treating fuzzy sets are also discussed. 



1. Introduction 

A metric defined in general topology [TJ[2], based on a natural notion of distance 
between points, is generally extensible to distance between sets or more complex 
elements. The Hausdorff metric is such a typical one and practically used for 
image data analysis [3J, but it has some problems. In the Euclidean metric on 
K., for example, the Hausdorff distance between bounded subsets of R depends 
only on their suprema or infima, no matter how the other elements of the sets are 
distributed within a certain range, which means it places importance on extremes 
and disregards the middle parts of the sets. This is a drawback because it is 
sensitive to noises, errors and outliers in analyzing real world data. There is a need 
to develop another metric that reflects the overall characteristics of elements of the 
sets. 

In computer science, especially in the fields of pattern recognition, classification, 
information retrieval and artificial intelligence, it is important for data analysis 
to measure similarity or difference between data objects such as documents, im- 
ages and signals. If the data objects can be represented by vectors, a conventional 
distance between vectors is a proper measure in their vector space. In practice, 
however, there are various data objects that should be dealt with in the form of 
collections of sets, probability distributions, graph structured data, or collections 
consisting of more complex data elements. To analyze these data objects, numer- 
ous distance- like functions have been developed [4], like the Mahalanobis distance 
and the Kullback-Leibler divergence, even though they do not necessarily satisfy 
symmetry and/or the triangle inequality. 

As a true metric, besides the Hausdorff metric, there is another type of distance 
functions of sets, such as the Jaccard distance, based on the cardinality of the 
symmetric difference between sets or its variations. However, it measures only the 
size of the set difference, and takes no account of qualitative differences between 
individual elements. Thus, both metrics are insufficient to analyze informative data 
sets in which each element has its own specific meaning. 
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This paper presents a new distance function between sets based on an average 
distance. It takes all elements into account. It is a metric if the sets are non-empty 
finite subsets of a metric space, and includes the Jaccard distance as a special case. 
By using the power means [5], we obtain generalized forms that also include the 
Hausdorff metric. Extensions of the metric to hierarchical collections of infinite 
subsets will be useful for treating fuzzy sets and probability distributions. 

2. Preliminaries 

The metric is extended to various types of generalized metrics. To avoid confu- 
sion in terminology, the following definition is used. 

Definition 1. (Metric) Suppose X is a set and d is a function on X x X into 
M. Then d is called a metric on X if it satisfies the following conditions, for all 
a, b, c G X , 

Ml: d(a,b) > (non- negativity) , 

M2: d(a,a) = 0, 

M3: d(a,b) = a = b, 

M4: d(a,b) =d(b 7 a) (symmetry), 

M5: d(a,b) +d(b,c) > d(a,c) (triangle inequality). 

The set X is called a metric space and denoted by {X,d). The function d is 
called distance function or simply distance. 

The metric is generalized by relaxing the conditions as follows: 

• If d satisfies Ml, M2, M4 and M5, then it is called a pseudo-metric. 

• If d satisfies Ml, M2, M3 and M5, then it is called a quasi-metric. 

• If d satisfies Ml, M2, M3 and M4, then it is called a semi-metric. 

This terminology follows [lj[2], though the term "semi-metric" is sometimes referred 
to as a synonym of pseudo-metric [4] . 

A set-to-set distance is usually defined as follows (see, e.g., [6]): Let A and B be 
two non-empty subsets of X. For each x G X . the distance from x to A, denoted 
by dist(ir, A), is defined by the equation 

(2.1) dist(x, A) = inf {d(x, a) | a G A}. 

This is fundamental not only to the definitions of a boundary point and an open 
set in metric spaces but also to the generalization of a metric space to approach 
space [7]. Similarly, the distance from A to B can be straightforwardly defined by 

(2.2) dist(A, B) = inf{d(a, b) \ a G A, b G B}. 

The function dist() is neither a pseudo-metric nor a semi-metric. However, let 
S(X) be the collection of all non-empty closed bounded subsets of X. Then, for 
A,B e S(X), the function h(A, B) defined by 

(2.3) h(A, B) = max{sup{dist(&, A) \ b G B}, sup{dist(a, B) \ a G A}} 

is a metric on S(X), and h is called the Hausdorff metric. The collection S(X) 
topologized by the metric h is called a hyperspace in general topology. 

In computer science, data sets are generally discrete and finite. A popular metric 
is the Jaccard distance (or Tanimoto distance, Marczewski-Steinhaus distance [1]) 
that is defined by 

(2-4) j(A,B) = !^ AB \ , 

\ > j\ > i \AUB ' 



METRIC BASED ON AVERAGE DISTANCE BETWEEN SETS 



3 



where \A\ is the cardinality of A, and A denotes the symmetric difference: AAB = 
(A \ B) U (B \ A). In addition, |AA£?| is also used as a metric. 

In cluster analysis [8] , the distance (|2.2[) is used as the minimum distance between 
data clusters for single-linkage clustering, and likewise the maximum distance is de- 
fined by replacing infimum with maximum for complete-linkage clustering. More- 
over, the group-average distance (or average distance, mean distance) defined as 
g(A, B) in the following is also typically used for hierarchical clustering. Although 
these three distance functions are not metrics, the group-average distance plays an 
important role in this paper. 

Lemma 2. Suppose (X,d) is a non-empty metric space. Let S(X) denote the 
collection of all non-empty finite subsets of X . For each A and B in S(X), define 
g(A, B) on S(X) x S(X) to be the function 

(2-5) 9(AB) = J ^—Y,J2 d ^ b y 

1111 aeAb£B 

Then g satisfies the triangle inequality. 

Proof. The triangle inequality for d yields d(a, b) + d(b, c) — d(a, c) > for all 
a,b,c£ X. Then, for all A, B, C £ S(X), we have 

(2.6) g(A,B)+g(B,C)-g(A,C) 

= lAll i lin E E E ^ 6 ) + d ^ c ) - d ^ c )) ^ °- 

' ' ' ' ' a£A b£B cGC 

□ 



For ease of notation, let s (A, B) be the sum of all pairwise distances between A 
and B such that 

(2.7) s(A,B) = J2Y, d ( a > b )> 

so that g(A, B) = (\A\ \B\)~ 1 s(A, B). Since d is a metric, we have s(A,B) > 0, 
s(A,B) = s(B,A), and s({x}, {x}) = for all x G X. If A = or B = 0, then 
s(A,B) = due to the empty sum. If A and B are countable unions of disjoint 
sets, it can be decomposed as follows: 

n m n m 

(2.8) s(\jA h {jB^ =J2J2 s ( Ai > B ^> 

i j i j 

where Ai n Aj = = Bi n Bj for i ^ j. Furthermore, we define t(A, B, C) by the 
following equation 

(2.9) t{A, B, C) = \C\ s{A, B) + \A\ s(B, C) - \B\ s(A, C). 

It follows from Lemma H that t(A,B,C) > for A, B, C £ S(X), which is a 
shorthand notation of the triangle inequality (|2.6p . 
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3. Metric based on average distance 

Theorem 3. Suppose ( A, <i) is a non-empty metric space. Let 5(A) denote the 
collection of all non-empty finite subsets of X . For each A and B in S(X), define 
f(A, B) on S{X) x S{X) to be the function 

(3.1) f(A,B) = — — KtttY V d(a,b) + - — V Vd(a,6). 
v ' JK ' ; AU5 A ^ ^ \AUB \ B ^ ^ y ' ' 

Then f is a metric on 5(A). 

Proof. The function / can be rewritten, using s in (|2.7|) . as 

A ' J |AUB| \A\ |AUfl| |£| ' 

It is non-negative and symmetric. If A = B, then s(A, B \ A) = s(A, 0) = and 
s(A \B,B) = s(0,B) = 0, so that f(A,B) = 0. Conversely, if f(A,B) = 0, then 
s(A, B\A) = = s(A\B,B) for A, B E S(X). This holds if, and only if, B \ A = 
$ = A\B, which implies B C A and A C B. Then we have /(A,B) = 0«A = B. 

The triangle inequality is straightforwardly proved to be f(A,B) + f(B,C) — 
f(A,C) > by showing that the left-hand terms are transformed into the sum 
of non- negative terms of s and t in (|2.9p . Let A U B U C be decomposed into five 
disjoint partitions: a = A\(BLiC), (3 = B\(AuC), 7 = C\(AUB), ( = A(~)C\B, 
and 9 = B \ f3 = B n (A U C). Then we have 

\A\ \B\ \C\ \A UB\\BU C\ \A U C\ (f(A, B) + f(B, C) - f(A, C)) 

= \B\ \C\ (\6 U C\ t(A, B\A, 7 )+ \a\ t(A, f3, 7) + \B U C| t(A, [3, C \ A)) 

+ \A\ \B\ (\A U 6\ t{a, B\C,C) + | 7 | t(a, p, C) + \BU fl t(A \ C, 0, C)) 

+ \A\ \C\ (\A \ B\ t(a, B,C\B) + \B\ t(a, 6, 7) + \C \ B\ t(A \ B, B, 7)) 

+ \A\ \C\ \0\ (t(a, B, 7) + t(a, B, + t({, B, 7)) 

+ 2 L4| |C| (\9 U C\ \A U 6\ + |0| \A U C\)s(B, Q 

+ 2\A\ \B\ \C\ \BUC\s(f3,AnC) > 0. 

The details are given in Appendix A. □ 

The function / in (|3.ip can be rewritten, using g in (|2.5p . as 

(3.2) f(A, B) = } B ± A \ g(A, B\A)+ , , ffW \ £, B). 
\ > L4u.B| yv x ; |AUB| yv x ; 

In (5(A), /), for all a, 6 e A, we have /({a}, {&}) = d(o, b) so that {{a;} x 6 A} 
is an isometric copy of A. If A n -B = 0, then f(A,B) = g(A,B). If d is a 
pseudo-metric, then so is /. 

Example 4. If d is the discrete metric, where d(x, y) = if x = y and d(x, y) = 1 
otherwise, then /(A, _B) is equal to the Jaccard distance (|2.4[) . 

Corollary 5. Suppose (X,d) is a non-empty metric space. Let 5(A) denote the 
collection of all non-empty finite subsets of X . For each A and B in 5(A), define 
e(A, B) on 5(A) x 5(A) to be the function 



1111 \aeAbeB a 



eAnBbeAnB / 
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Then e is a semi-metric on S(X). 
Proof. Let e(A,B) be rewritten as 

e(A, B) = —^—(s(A\B,B\A) + s(AnB,B\A) + s(A\B,AnB)). 
\A\ \ B \ 

In a similar manner to the proof of Theorem [3l it can be proved that the conditions 
from Ml to M4, except for M5 (triangle inequality), are satisfied. □ 

It is noted that the triangle inequality e(A, B) + e(B, C) > e(A, C) holds if 
\8\ \e\ \<q\ = 0, where 8 = Af\ B\C , e = B C\ C \ A and 77 = A n B n C. Otherwise, 
for example, if A = 8 U 77, B = 8 U r\ U e and C = rj U e for non-empty S, e and rj, 
then we have 

|A| \B\ \C\ (e(A, B) + e{B, C) - e(A, C)) 
= \S\ 8(6, e) - \e\ s(8, v) - \S\ s(r), e) = -t(8, r,, e) < 0, 
so that the condition M5 is not generally satisfied. 

4. Extensions 

This section discusses future directions for generalization of the average distance 
based on the power mean and extensions to metrics on collections of infinite sets. 

4.1. Generalization based on the power mean. The distance function p.l[) 
can be unified with the Hausdorff metric for finite sets by using the power mean. 
To simplify expressions, we use the following notation. Let Mp\x G A,ip,w) be 
an extended weighted-power- mean of ip(x) such that 

(4.1) M W (x€A,ip,w)= ( 1 V w(x) (^(x)) p ) , 
and its variation using the exponential transform of 

(4.2) Mf\x^A,^,w) = -In ( - 1 V w(x) exp^V^)) ) , 

where i S {0, 1} indicates one of the two types (|4.1j) and (|4.2j) . p is an extended 
real number, tp is a non- negative function of 2: 6 A, and w is a weight such that 
w(x) G (0, 1] for each x and J2 X £A w i x ) > 0- I n addition, let Mp\x G A, ip) denote 
the abbreviation of the equal weight case Mp^ (x G A, tp, 1a(x)) , where 1a(%) is the 
indicator function defined by 1a(x) = 1 for x G A and 1a(x) — for x ^ A. If there 
exists x G A such that ip(x) — for p < in (|4.1|) . then we define Mp 1 ' = 0, which 
is consistent with taking the limit ip( x ) ~* + , though such a case is undefined in 
the conventional power mean to avoid division by zero. 

The power mean includes various types of means [S], which are parameterized 
by p. By taking limits also for p = 0, ±00, we have the following: 

M?\x& A,^(x),l A (x)) =Mf (ze A,^(x)) =±- ^il>(x), 

' ' x£A 

M<j' (x G A, V(ar), w(x)) = M« (a: G A, ijj(x)) = max{V'(a;) | x G A}, 
M { ^ ) 00 (x G A,ip(x),w(x)) = M^ ) 00 (x G A,V(ar)) = min{'ip(x) I a; G A}. 
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Both m[ 1] and M^ 0) give the same arithmetic mean, whereas Mq gives the geo- 



metric mean, and neither nor depends on w(x 

(i) 

There are some forms of function composition of Mp that include the distance 
function (|3.1[) as a special case, for example, as follows: 

(4.3) u£f(A,B) = M« (xeADB, (l B \ A (x)M^(y G A,d(x,y)) 

+l A \ B (x)M^(yeB,d(x,y) 

(4.4) v^f(A,B) 

= M« (s G {A, B}, MW (xeAuB, l s c (x) M« (y G S, d(x, y)) ) ) , 

where i, j, fc G {0, 1} and is the complement of S. 

In addition, let w be extended to w G [0, 1] to include zero, though a weight 
for at least one summand must still be positive. Furthermore, it is assumed that 
• oo = 0, in order to ensure • P = for p < in Mp , so that the zero- weight can 
be used for excluding terms from averaging even if ip(x) = (i.e., distance zero) in 
the terms. Then, the function (|4.3|) can be simply expressed as 



(4.5) v$,f (A, B) = M« (x G A U B, (i/EiUB, y), «;(a:, y)) 
by using the weight function defined by 

w(x, y) = [x e A \ B}[y e B] + [x e A D B][x = y] + [x e B \ A}[y e A] 
= ^a\b{x) 1b(v) + lAnB{x)[x = y] + 1 b \a{x) 1a(v), 

where [•] denotes the Iverson bracket, that is a quantity defined to be 1 whenever 
the statement within the brackets is true, and otherwise. 

The distance function / in (|3.1|) and the Hausdorff metric (|2.3|) are expressed by 

f(A,B) = 4f(A,B) = 2v$f(A,B), 
h(A 7 B) = u^l^A, B) = v&glM B), 

respectively, where i, j, k G {0, 1}. 

Although it is unclear, at present, what conditions on the parameters i, j, k, p, 
<7, and r are necessary for (|4.3j) and (|4.4|) to be metrics, these generalized forms are 
capable of generating various distance functions in fact as follows: 

Example 6. The exponential types u p °f\A, B) and Vr°p q ^ (A, B) are written as 
u (o,o) (A B ) = -ia\ - V I -g^j V e qd ^ 

\' xeAuB \ 1 1 yeA 

p/q' 

UnB(x)+ 1 -^f±Y. eqdiX ' y) 
1 1 yeB 
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V (°>°>°)(A B) = -hi\ - I - V (— Ve^.»)V g + |A| 

„ r>M (A*) r m . ^ i |A 2^e j + |AuB 




r/p 

^dCx.s/) V" , \ A \ \ 
I \AUR\ I 

y€A 

|AUfl| ^ VI s ! ^ / \AUB\ } 

x€A\B Vl 1 y£B 7 1 V 

If d is the discrete metric multiplied by a positive constant A, then we have 

The functions (|4.6[) and (j4.7|) are metrics for p > and p < 0, respectively. The 
proofs of each triangle inequality are outlined in Appendix B. If p = 0, then it is 
the same situation as Example 2) By taking the limit for p — 0, we can see that 
both functions are equal to the Jaccard distance (|2.4|) except for the coefficient A. 

4.2. Hierarchical metric spaces. Suppose that (X, d) is a metric space and S(X) 
is the collection of all non-empty finite subsets of X. Let k be a non-negative 
integer, let Sk+i(X) denote the collection of all non-empty finite subsets of Sk(X), 
and let /& be a metric on Sk(X), where Si(X), Sq(X), fx and /o correspond to, 
respectively, S(X), X, /, and d in Theorem [3J For k > 1, in much the same way, 
for each A and B in Sk(X), the function fk{A, B) can be defined by 

(4.8) J- k {A,tS)- 



\AUB\\A\ \AUB\\B\ 

which generates a metric space (Sf.(X), fA based on (Sk-i(X), f k -i)- This metric 
will be useful for constructing hierarchical hyperspaces. 



4.3. Duality. There is a kind of duality between sets and elements with respect 
to their distance functions. For example, we can define the functions D and d 
symmetrically as follows: 

(4.9) D(A, B) = \{a | a G A}A{b | b G B}\ = \AAB\ , 

(4.10) d(o, 6) = \{A | a G A}A{B \ b e B}\ = \C(a)AC{b)\ , 

where C(a) = {A \ a E A}. The set-to-set distance D in (14. 9p is a metric due to 
the axiom of extensionality, and the element-to-element distance d in (I4.10P is a 
pseudo-metric. 

According to Theorem El D can be defined by / in (|3.1j) . instead of (|4.9j) . so that 
we have 

D(A,-B) = /(A,B), 

d(a,6) = |C(o)AC(6)| - / (f|C(o),f|C(6)) • 

In this case, Z? depends on c? so that Z? is also a pseudo-metric. Furthermore, in 
the situation of Section |4~21 let (X, d) be a metric space, let S\(X) be the collection 
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of all non-empty finite subsets of X, and let C(a) = {A 6 Si(X) a G A} be an 
element of 52 (X). If d is the discrete metric, then we have 

D(A,B) = f 1 (A,B)=j(A,B), 
f 2 (C(a),C(b)) = Kd(a,b), 

where n is a certain positive real number such that n < 1. If there exists d such 
that d(a,b) — f2{C(a),C(b)) , then the isometric copy of (X,d) is contained in 
(<S2pQ, /2J and d can be regarded as the function of D. In general, there possibly 
exist D and d that are formally expressed by 

(4.11) D{A,B) = F({d(a,b) \aeA,be B}), 

(4.12) d(a, b) = G({D(A, B) \ a e A,b e B}) , 

where F and G may be such a generalized function given in (|4.5[) . It is interesting 
to consider whether F and G really exist and what features they have. In numerical 
analysis, D and d that are consistent with each other will be obtained by the itera- 
tive computation of (|4.11[) for all A,Be Sx(X) and (|4.12l) for all a,b e X, starting 
with an initial metric space (X, d) , if each converges to a non-trivial function. 

4.4. Generalized metrics. The group average distance g(A, B) in (|2.5p can be 

regarded as a generalized metric that satisfies conditions Ml (non- negativity) , M4 
(symmetry) and M5 (triangle inequality) in Definition [T] In conventional topology, 
there has been no such generalization by dropping both conditions M2 and M3, 
which are usually combined together into the single axiom d(a, b) = a = b 
(identity, reflexivity, or coincidence) . Although M3 can be dropped for the pseudo- 
metric so as to allow d(a, b) = for a ^ b, the self-distance d(a, a) = (M2) seems 
to be indispensable in point-set topology where the element is a point having no 
size. An exception is a partial metric [9] that is defined to satisfy Ml, M3, M4 and, 
instead of M5, the following partial metric triangularity, 

(4.13) d{a, b) + d{b, c) > d(a, c) + d(b, b). 

In computer science or information science, the element of data sets is not merely 
a simple point. It may have rich contents inside. Some elements may have internal 
structures which cause non-zero self-distance, and some elements may have different 
properties each other, even though they are indiscernible from a metric point of 
view. The concept of distance can be used for measuring not only the difference 
between objects but also the cost of moving or the energy of transition between 
states. This is the reason why the generalization toward non-zero self-distance is 
worth considering. If the triangle inequality holds, it provides an upper and lower 
bound for them. The group average distance g{A, B) can be such a typical one, 
and that it is simpler and more natural than the metric f(A,B) in (|3.1|1 . 

Incidentally, the function g(A, B) is not a partial metric because it does not 
satisfy f|4. 13|) . On the other hand, f(A,B) gives an approach to an instance of 
partial metrics from a special case of (I4.7[) in Example [SJ By taking the limit as 
A — > 00 for p < and multiplying a positive constant, for non-empty finite sets A 
and B, we have the following metric, 

D„(i4,B)=Iog|AUB|-i/log(|A| 



METRIC BASED ON AVERAGE DISTANCE BETWEEN SETS 







where v — 1/2. Its triangle inequality is equivalent to\AU B\\B L) C\ > \A\J C\\B\. 
This suggests, for v e [0, 1/2), D v is a partial metric on a collection of non-empty 
finite sets. 

4.5. Extension to infinite sets. If S(X) is the collection of all non-null measur- 
able subsets of (X, d), and d is Lebesgue integrable on each element of S(X), then 
the group average distance g(A, B), for A, B £ <S(X), can be defined by 

(4.14) 9 (A,B) = j A [jj^My)) M*) 

where x S A, y £ B, and /z is a measure on X, and then the distance function Q3.2p 
can be extended to 

(4.15) f(A, B) = ff(A 5 \ A) + ^§ 5 (A \ -B, -B). 

/^(^4 U B) [i\A U iJ) 

If d is the discrete metric, then (|4.15[) is equal to the Steinhaus distance [J]. 

Example 7. Let (R, d) be a metric space and let d(x, y) = \x — y\. For two intervals 
A and B, the distance function (|4.15[) can be expressed as 

\sup(A) - sup(B) \ + |inf(A) - inf(B) 



f(A,B) = 



2 

|sup(A) - sup(B)| |inf(A) - mf(B)\ 



[{A cB)W(Ad B)] 



sup(A U B) — inf (A U B) 

If A £ B and A ^ B (i.e., [(A C B) V (A D S)] = 0), then f(A, B) is equal to 
the distance between the centers of A and B. This is consistent with an intuitive 
notion of the distance between balls in this (R, d). 

If S{X) is the collection of all non-empty, countably infinite subsets (measure- 
zero sets) of X, then g(A,B) and f(A,B) should be defined by taking limits in 
(|2.5j) and (|3.2j) . provided that both have definite values. In order to determine the 
average distance, we have to define a proper condition, which should be said to be 
"averageable". The average distance will strongly depend on accumulation points in 
A and B, and it will require additional assumptions on the difference of the strength 
between the accumulation points. This requirement is closely related to a "relative 
measure" that is needed to obtain the ratio of the cardinality of an infinite set to 
the cardinality of its superset in (|3.2|1 . In conventional measure theory, however, 
any set of cardinality Ho is a null set having measure zero so that both counting 
measure and Lebesgue measure are useless for computing the ratio. It is necessary 
to use another measure. A feasible solution is discussed in the following section. 

4.6. Estimation by sampling. In application to computational data analysis, 
statistical estimation by sampling is very useful for obtaining the approximate value 
of f(A, B) when the size of the sets is very large. According to the law of large 
numbers, if enough sample elements are selected randomly, an average generated 
by those samples should approximate the average of the total population. The 
procedure is as follows: 

(1) Choose a superset P of A U B as a population such that P Z) AVJ B. 

(2) Select a finite subset S of P as a sample obtained by random sampling. 

(3) Let Sa = S n A and Sb = S fl B. Then, compute J(Sa, Sb) for approxi- 
mation of f(A, B). 
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The sampling process and its randomness are crucial for efficiently estimating a good 
approximation. Some useful hints could be found in various sampling techniques 
developed for Monte Carlo methods |10| . In most cases, sampling error is expected 
to decrease as the sample size increases, except for situations where the distribution 
of d has no mean (e.g., Cauchy distribution). 

The notion of sampling suggests an intuitive measure to define a relative measure 
on a cr-algebra over a set X. which could be called "sample counting measure". 
Suppose A and B are subsets of X. Let p(A : B) be the ratio of the cardinality of 
A to the cardinality of B, let P be a superset of A U B, and let S n be a non-empty 
finite subset of P such that S n = lj" =1 Y% where Yj, is the i-th non-empty sample 
randomly selected from P. Then, the ratio p(A : B) can be determined by taking 
a limit of n as it approaches to oo as follows: 

p(A : B) — hm , 

n-s-oo |5„ H D | 

if there exist such a limit and a random choice function that performs random 
sampling. Otherwise, instead of random sampling, systematic sampling could be 
available if the elements of X are supposed to be distributed with uniform density 
in its measurable metric space. For example, suppose there exists a finite partition 
of P where every part has an almost equal diameter. It seems better for S n to have 
exactly one element with each of the parts. 

4.7. Metrics for fuzzy sets and probability distributions. A fuzzy set can 
be represented by a collection of crisp sets so that the distance between fuzzy sets 
can be defined by the distance between the collections of such crisp sets. Let A be 
a fuzzy set: A = {(x, 771^4(2;)) | x 6 X}, where m^x) is a membership function, 
and let A a be a crisp set called an a — lebel set pT| such that A a — {x G X | 
n^A^x) > a}. Then, A can be represented by the following set of ordered pairs: 
C(A) = {(A a ,a) I a € (0, 1]}. The distance between two fuzzy sets A and B can 
be defined by f 2 (C(A),C(B)), where there may be various ways to treat a. This 
notion is also applicable to the distance between probability distributions, where 
probability density functions are used instead of the membership function. 

5. Concluding Remarks 

We have found that, for a metric space (X,d), there exists a distance function 
between non-empty finite subsets of X that is a metric based on the average dis- 
tance of d. The distance function (|3.1I) in Theorem[3]is the most typical one, which 
includes the Jaccard distance as a special case where d is the discrete metric. Its 
extensions based on the power mean will be useful to develop generalized forms that 
also include the Hausdorff metric and the other various distance functions. Fur- 
thermore, the extensions to infinite subsets of X will provide metrics for measuring 
dissimilarity of fuzzy sets and probability distributions. 

Appendix A. Triangle Inequality in Theorem [3] 
The triangle inequality for 

f(A, B) = {\AUB\ \A\)- x s{A,B\A) + {\AuB\ \ B, B) 

can be proved by showing the following inequality: 

|^| \B\ \C\ \A U B\ \B U C\ \A U C\ (f(A, B) + f(B, C) - f(A, C)) > 0. 
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Let A U B U C be decomposed into the following seven disjoint sets: 

a = A\(BUC), /3 = B\(AUC), 7 = C\(iUB), 

5 = Ar\B\C, E = BnC\A, ( = Cn4\B, 77 = A n B n C, 

and let 9 = B \ /3 = B n (A U C), so that we have 

A = aU£UCUr?, |A| = |a| + \5\ + \(\ + \ V \ , 

B = f3uSUeUri, \B\ = \/3\ + \S\ + \e\ + \v\ , 

C = 7 U £ UCUt ? , \C\ = \ 7 \ + \e\ + \C\ + \ V \, 

9 = 51)61)7], \6\ = \5\ + \e\ + \r)\. 

Taking account of (|2.8[) and (|2.9p , we have 

\A\ \B\ \C\ \A U B| |B U C\ \A U C| (/(A B) + /(B, C) - /(A, C)) 
= \B\ \C\ \B U C\ \A U C| s(A, B \ A) + | A| |C| \B U C| |A U C\ s(A \ B, B) 
+ \A\ \C\ \A U B\ \A U C| s(B, C \ B) + | A| |B| | A U B\ \ A U C| s(B \ C, C) 

- \B\ \C\ \A U B\ \B U C| s(A, C \ A) - \A\ \B\ \A U B\ \B U C| s(A \ C, C) 

= \B\ \C\ (| 7 | |S U C\ s(A, B\A) + (| 7 | \a\ + |B U C| \C\ A\)s(A, p) 

+ \BU(\ \A\ (s{A \C,/3)+ s(A n C, P)) + (\(3\ | 7 | + |B U C| |A U e\)s(A, e)) 
+ \A\ \C\ ((| 7 | \A U C| + |C| | 7 | + ICI |A U e\)s(a, B) 

+ |B| | A U e\ (s(a, B \ C) + s(a, B n C)) + |B| | 7 | (s(a, /3) + s(a, 0)) 

+ (|* U C\ | 7 | + \S U C| |A U e\ + \p\ \A U C\)s((, B)) 
+ |A| |C| ((|a| |A U C| + |C| |a| + ICI |<5 U C\)s(B, 7 ) 

+ |B| \6 U C| (s(B \ A, 7 ) + s(A n B, 7 )) + |B| |a| (s(/3, 7 ) + S (0, 7 )) 

+ (| A U e| |a| + |A U e\ \S U C| + |/3| | A U C|)s(B, C)) 
+ |A| |B| (|a| |A U e\ s(B \C,C) + (\a\ | 7 | + |CU B| |A\ C\)s((3, C) 

+ |C U B| |C| (s(/3, A n C) + s(/3, C\ A)) + (|/?| |a| + |A U B| |<5 U C|)s(<5, C)) 

- |B| |C| (|C U B| |/3| S (A, C \ A) + (|a| |/3| + |B \ A| \S U C|)s(A, 7 ) 

+ |A| |5UC| (s(aUC, 7 ) + s(AnB, 7 )) + (|/3| | 7 | + |AUe| |B U C\)s(A, e)) 
- \A\ \B\ (\(3\ \B U CI s(A \C,C) + (\(3\ | 7 | + |A U e\ \B \ C\)s(a, C) 

+ \A U e| \C\ (s(a, C U 7 ) + s(a, B n C)) + (|a| |/3| + |A U B| \6 U C\)s(S, C)) 

= \B\ \C\ (\S U C\ t(A, B \ A, 7) + |a| t(A, /?, 7 ) + |B U CI t(A, 0, C \ A)) 
+ |A| |B| (|A U e| t(a, B \ C, C) + | 7 | t(a, 0, C) + |B U CI t(A \ C, /3, C)) 
+ |A| |C| ((|AUC| + |ClMa,B,7) + \B\t(a,6, 7 )) 
+ \A\ \C\ (|A U e| t(a, B, () + \C U 5| t(C, B, 7 )) 
+ 2 |A| |C| (|* U C\ \A U e| + |/3| |A U C|)s(B, C) 
+ 2 |A| |B| |C| |B U CI s(f3, A n C) > 0. 

where the equality holds if all terms of s and t are zero. 
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Appendix B. Triangle Inequalities of Example |6] 
The triangle inequality for (|4.6| can be proved as follows: Let x = e pX and let 
x\AAB\ + \Af)B\\ fx\BAC\ + \Bf)C\\ fx \AAC\ + \A n C\" 



T ^ V \AUB\ J V \BUC\ J V \AUC\ 
Then the triangle inequality for p > is equivalent to t(x) > for x > 1. The first 
derivative of r(a;) with respect to a; is 

r'^) = 2(a; - l)j(A, B)j(B, C) + j(A, B) + j(B, C) - j(A, C), 

where j is the Jaccard distance (|2.4I) . Since r(l) = and r'(x) > for x > 1, we 
have r(e pA ) > for p > 0. 

The triangle inequality for (|4.7p can be proved as follows: Let y = 1 — e pA and 

let 

/ |A\C| \ / 



l^usry v |Ausry v i-Bucry v \buc\- 

where 4>{y) is the cubic function of y with a negative leading coefficient. Then the 
triangle inequality for p < is equivalent to r(y) > for y £ [0, 1). The function 
<p(y) satisfies the following inequalities: 0(0) > 0, 0(1) = r(l) > 0, </>'(l) < 0, and 
0"(1) > 0. These inequalities can be proved by decomposition of A, B, and C into 
a, ft, 7, 5, e, C and 77 defined in Appendix A. Then, we have <p(y) > for y S [0, 1), 
therefore, r(l - e pA ) > for p < 0. 
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OSAMU FUJITA 



Abstract. This paper presents a distance function between sets based on an 
average of distances between their elements. The distance function is a metric 
if the sets are non-empty finite subsets of a metric space. It can be applied 
to produce various metric spaces on collections of sets and will be useful for 
analyzing complex data sets in the fields of computer science and information 
science. Its generalizations to include the Hausdorff metric and extensions to 
infinite sets for treating fuzzy sets are also discussed. 



1. Introduction 

A metric defined in general topology [TJ[2], based on a natural notion of distance 
between points, is generally extensible to distance between sets or more complex 
elements. The Hausdorff metric is such a typical one and practically used for 
image data analysis [3], but it has some problems. In the Euclidean metric on R, 
for example, the Hausdorff distance between bounded subsets of R often depends 
only on their suprema or infima, no matter how the other elements of the sets are 
distributed within a certain range, which means it places importance on extremes 
and disregards the middle parts of the sets. This is a drawback because it is 
sensitive to noises, errors and outliers in analyzing real world data. There is a need 
to develop another metric that reflects the overall characteristics of elements of the 
sets. 

In computer science, especially in the fields of pattern recognition, classification, 
information retrieval and artificial intelligence, it is important for data analysis 
to measure similarity or difference between data objects such as documents, im- 
ages and signals. If the data objects can be represented by vectors, a conventional 
distance between vectors is a proper measure in their vector space. In practice, 
however, there are various data objects that should be dealt with in the form of 
collections of sets, probability distributions, graph structured data, or collections 
consisting of more complex data elements. To analyze these data objects, numer- 
ous distance- like functions have been developed [4], like the Mahalanobis distance 
and the Kullback-Leibler divergence, even though they do not necessarily satisfy 
symmetry and/or the triangle inequality. 

As a true metric, besides the Hausdorff metric, there is another type of distance 
functions of sets, such as the Jaccard distance, based on the cardinality of the 
symmetric difference between sets or its variations. However, it measures only the 
size of the set difference, and takes no account of qualitative differences between 
individual elements. Thus, both metrics are insufficient to analyze informative data 
sets in which each element has its own specific meaning. 



Key words and phrases. Metric, distance between sets, average distance, power mean, Haus- 
dorff metric. 
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This paper presents a new distance function between sets based on an average 
distance. It takes all elements into account. It is a metric if the sets are non-empty 
finite subsets of a metric space, and includes the Jaccard distance as a special case. 
By using the power means [5], we obtain generalized forms that also include the 
Hausdorff metric. Extensions of the metric to hierarchical collections of infinite 
subsets will be useful for treating fuzzy sets and probability distributions. 

2. Preliminaries 

The metric is extended to various types of generalized metrics. To avoid confu- 
sion in terminology, the following definition is used. 

Definition 1. (Metric) Suppose X is a set and d is a function on X x X into 
M. Then d is called a metric on X if it satisfies the following conditions, for all 
a, b, c G X , 

Ml: d(a,b) > (non- negativity) , 

M2: d(a,a) = 0, 

M3: d(a,b) = =*> a = b, 

M4: d(a,b) =d(b 7 a) (symmetry), 

M5: d(a,b) +d(b,c) > d(a,c) (triangle inequality). 

The set X is called a metric space and denoted by {X,d). The function d is 
called distance function or simply distance. 

The metric is generalized by relaxing the conditions as follows: 

• If d satisfies Ml, M2, M4 and M5, then it is called a pseudo-metric. 

• If d satisfies Ml, M2, M3 and M5, then it is called a quasi-metric. 

• If d satisfies Ml, M2, M3 and M4, then it is called a semi-metric. 

This terminology follows [lj[2], though the term "semi-metric" is sometimes referred 
to as a synonym of pseudo-metric [4] . 

A set-to-set distance is usually defined as follows (see, e.g., [6]): Let A and B be 
two non-empty subsets of X. For each x G X . the distance from x to A, denoted 
by dist(x, A), is defined by the equation 

(2.1) dist(x, A) = inf{d(a;, a) | a e A}. 

This is fundamental not only to the definitions of a boundary point and an open 
set in metric spaces but also to the generalization of a metric space to approach 
space [7]. Similarly, the distance from A to B can be straightforwardly defined by 

(2.2) dist(A, B) = inf{d(o, b) \ a e A, b e B}. 

The function dist() is neither a pseudo-metric nor a semi-metric. However, let 
S(X) be the collection of all non-empty closed bounded subsets of X. Then, for 
A,Be S(X), the function h(A, B) defined by 

(2.3) h(A, B) = max{sup{dist(&, A) \ b G B}, sup{dist(a, B) \ a G A}} 

is a metric on S(X), and h is called the Hausdorff metric. The collection S(X) 
topologized by the metric h is called a hyperspace in general topology. 

In computer science, data sets are generally discrete and finite. A popular metric 
is the Jaccard distance (or Tanimoto distance, Marczewski-Steinhaus distance [1]) 
that is defined by 

(2-4) j(A,B) = !^ AB \ , 
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where \A\ is the cardinality of A, and A denotes the symmetric difference: AAB = 
(A \ B) U (B \ A). In addition, |AA£?| is also used as a metric. 

In cluster analysis [8] , the distance (|2.2p is used as the minimum distance between 
data clusters for single-linkage clustering, and likewise the maximum distance is de- 
fined by replacing infimum with maximum for complete-linkage clustering. More- 
over, the group-average distance (or average distance, mean distance) defined as 
g(A, B) in the following is also typically used for hierarchical clustering. Although 
these three distance functions are not metrics, the group-average distance plays an 
important role in this paper. 

Lemma 2. Suppose (X,d) is a non-empty metric space. Let S(X) denote the 
collection of all non-empty finite subsets of X . For each A and B in S(X), define 
g(A, B) on S(X) x S(X) to be the function 

(2-5) 9(AB) = J ^—Y,J2 d ^ b y 

1111 aeAb£B 

Then g satisfies the triangle inequality. 

Proof. The triangle inequality for d yields d(a, b) + d(b, c) — d(a, c) > for all 
a,b,c£ X. Then, for all A, B, C £ S(X), we have 

(2.6) g(A,B)+g(B,C)-g(A,C) 

= lAll i lin E E E ^ 6 ) + d ^ c ) - d ^ c )) ^ °- 

' ' ' ' ' a£A b£B cGC 

□ 



For ease of notation, let s (A, B) be the sum of all pairwise distances between A 
and B such that 

(2.7) s(A,B) = J2Y, d ( a > b )> 

so that g(A, B) = (\A\ \B\)~ 1 s(A, B). Since d is a metric, we have s(A,B) > 0, 
s(A,B) = s(B,A), and s({x}, {x}) = for all x G X. If A = or B = 0, then 
s(A,B) = due to the empty sum. If A and B are countable unions of disjoint 
sets, it can be decomposed as follows: 

n m n m 

(2.8) s(\jA h {jB^ =J2J2 s ( Ai > B ^> 

i j i j 

where Ai n Aj = = Bi n Bj for i ^ j. Furthermore, we define t(A, B, C) by the 
following equation 

(2.9) t{A, B, C) = \C\ s{A, B) + \A\ s(B, C) - \B\ s(A, C). 

It follows from Lemma H that t(A,B,C) > for A, B, C £ S(X), which is a 
shorthand notation of the triangle inequality (|2.6p . 
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3. Metric based on average distance 

Theorem 3. Suppose ( A, <i) is a non-empty metric space. Let 5(A) denote the 
collection of all non-empty finite subsets of X . For each A and B in S(X), define 
f(A, B) on S{X) x S{X) to be the function 

(3.1) f(A,B) = — — KtttY V d(a,b) + - — V Vd(a,6). 
v ' JK ' ; AU5 A ^ ^ \AUB \ B ^ ^ y ' ' 

Then f is a metric on 5(A). 

Proof. The function / can be rewritten, using s in (|2.7|) . as 

A ' J |AUB| \A\ |AUfl| |£| ' 

It is non-negative and symmetric. If A = B, then s(A, B \ A) = s(A, 0) = and 
s(A \B,B) = s(0,B) = 0, so that f(A,B) = 0. Conversely, if f(A,B) = 0, then 
s(A, B\A) = = s(A\B,B) for A, B E S(X). This holds if, and only if, B \ A = 
$ = A\B, which implies B C A and A C B. Then we have /(A,B) = 0«A = B. 

The triangle inequality is straightforwardly proved to be f(A,B) + f(B,C) — 
f(A,C) > by showing that the left-hand terms are transformed into the sum 
of non- negative terms of s and t in (|2.9p . Let A U B U C be decomposed into five 
disjoint partitions: a = A\(BLiC), (3 = B\(AuC), 7 = C\(AUB), ( = A(~)C\B, 
and 9 = B \ f3 = B n (A U C). Then we have 

\A\ \B\ \C\ \A UB\\BU C\ \A U C\ (f(A, B) + f(B, C) - f(A, C)) 

= \B\ \C\ (\6 U C\ t(A, B\A, 7 )+ \a\ t(A, f3, 7) + \B U C| t(A, [3, C \ A)) 

+ \A\ \B\ (\A U 6\ t{a, B\C,C) + | 7 | t(a, p, C) + \BU fl t(A \ C, 0, C)) 

+ \A\ \C\ (\A \ B\ t(a, B,C\B) + \B\ t(a, 6, 7) + \C \ B\ t(A \ B, B, 7)) 

+ \A\ \C\ \0\ (t(a, B, 7) + t(a, B, + t({, B, 7)) 

+ 2 L4| |C| (\9 U C\ \A U 6\ + |0| \A U C\)s(B, Q 

+ 2\A\ \B\ \C\ \BUC\s(f3,AnC) > 0. 

The details are given in Appendix A. □ 

The function / in (|3.ip can be rewritten, using g in (|2.5p . as 

(3.2) f(A, B) = } B ± A \ g(A, B\A)+ , , ffW \ £, B). 
\ > L4u.B| yv x ; |AUB| yv x ; 

In (5(A), /), for all a, 6 e A, we have /({a}, {&}) = d(o, b) so that {{a;} x 6 A} 
is an isometric copy of A. If A n -B = 0, then f(A,B) = g(A,B). If d is a 
pseudo-metric, then so is /. 

Example 4. If d is the discrete metric, where d(x, y) = if x = y and d(x, y) = 1 
otherwise, then /(A, _B) is equal to the Jaccard distance (|2.4[) . 

Corollary 5. Suppose (X,d) is a non-empty metric space. Let 5(A) denote the 
collection of all non-empty finite subsets of X . For each A and B in 5(A), define 
e(A, B) on 5(A) x 5(A) to be the function 



1111 \aeAbeB a 



eAnBbeAnB / 
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Then e is a semi-metric on S(X). 
Proof. Let e(A,B) be rewritten as 

e(A, B) = —^—(s(A\B,B\A) + s(AnB,B\A) + s(A\B,AnB)). 
\A\ \ B \ 

In a similar manner to the proof of Theorem [3l it can be proved that the conditions 
from Ml to M4, except for M5 (triangle inequality), are satisfied. □ 

It is noted that the triangle inequality e(A, B) + e(B, C) > e(A, C) holds if 
\8\ \e\ \<q\ = 0, where 8 = Af\ B\C , e = B C\ C \ A and 77 = A n B n C. Otherwise, 
for example, if A = 8 U 77, B = 8 U r\ U e and C = rj U e for non-empty S, e and rj, 
then we have 

|A| \B\ \C\ (e(A, B) + e{B, C) - e(A, C)) 
= \S\ 8(6, e) - \e\ s(8, v) - \S\ s(r), e) = -t(8, r,, e) < 0, 
so that the condition M5 is not generally satisfied. 

4. Extensions 

This section discusses future directions for generalization of the average distance 
based on the power mean and extensions to metrics on collections of infinite sets. 

4.1. Generalization based on the power mean. The distance function p.l[) 
can be unified with the Hausdorff metric for finite sets by using the power mean. 
To simplify expressions, we use the following notation. Let Mp\x G A,ip,w) be 
an extended weighted-power- mean of ip(x) such that 

(4.1) M W (x€A,ip,w)= ( 1 V w(x) (^(x)) p ) , 
and its variation using the exponential transform of 

(4.2) Mf\x^A,^,w) = -In ( - 1 V w(x) exp^V^)) ) , 

where i S {0, 1} indicates one of the two types (|4.1j) and (|4.2j) . p is an extended 
real number, tp is a non- negative function of 2: 6 A, and w is a weight such that 
w(x) G (0, 1] for each x and J2 X £A w i x ) > 0- I n addition, let Mp\x G A, ip) denote 
the abbreviation of the equal weight case Mp^ (x G A, tp, 1a(x)) , where 1a(%) is the 
indicator function defined by 1a(x) = 1 for x G A and 1a(x) — for x ^ A. If there 
exists x G A such that ip(x) — for p < in (|4.1|) . then we define Mp 1 ' = 0, which 
is consistent with taking the limit ip( x ) ~* + , though such a case is undefined in 
the conventional power mean to avoid division by zero. 

The power mean includes various types of means [S], which are parameterized 
by p. By taking limits also for p = 0, ±00, we have the following: 

M?\x& A,^(x),l A (x)) =Mf (ze A,^(x)) =±- ^il>(x), 

' ' x£A 

M<j' (x G A, V(ar), w(x)) = M« (a: G A, ijj(x)) = max{V'(a;) | x G A}, 
M { ^ ) 00 (x G A,ip(x),w(x)) = M^ ) 00 (x G A,V(ar)) = min{'ip(x) I a; G A}. 
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Both m[ 1] and Af (0) give the same arithmetic mean, whereas Mq 1 ' gives the geo- 



metric mean, and neither M$ nor M^, depends on w{x) 

(i) 

There are some forms of function composition of Mp that include the distance 
function (|3.1[) as a special case, for example, as follows: 

(4.3) u£f(A,B) = M« (xeADB, (l B \ A (x)M^(y £ A,d(x,y)) 

+l A \ B (x)M^(yeB,d(x,y) 

(4.4) v^f(A,B) 

= M« (s G {A, B}, MW (xeAIJB, 1 s c (x) M« (» £ S, d(x, y))) ) , 

where i, j, fc G {0, 1} and is the complement of S. 

In addition, let w be extended to w G [0, 1] to include zero, though a weight 
for at least one summand must still be positive. Furthermore, it is assumed that 
• oo = 0, in order to ensure • P = for p < in Mp , so that the zero- weight can 
be used for excluding terms from averaging even if ip(x) = (i.e., distance zero) in 
the terms. Then, the function (|4.3|) can be simply expressed as 



(4.5) v$,f (A, B) = M« (x G A U B, (i/EiUB, y), «;(a:, y)) 
by using the weight function defined by 

w(x, y) = [x £ A \ B] [y G B] + [x £ A n B] [x = y] + [x £ B \ A] [y G A] 
= ^a\b{x) ls(y) + lAnB{x)[x = y] + 1 b \a{x) 1a(v), 

where [•] denotes the Iverson bracket, that is a quantity defined to be 1 whenever 
the statement within the brackets is true, and otherwise. 

The distance function / in (|3.1|) and the Hausdorff metric (|2.3|) are expressed by 

f(A,B) = 4f(A,B) = 2v$f(A,B), 
h(A 7 B) = u^l^A, B) = v&glM B), 

respectively, where i, j, k £ {0, 1}. 

Although it is unclear, at present, what conditions on the parameters i, j, k, p, 
<7, and r are necessary for (|4.3j) and (|4.4|) to be metrics, these generalized forms are 
capable of generating various distance functions in fact as follows: 

Example 6. The exponential types u p °f\A, B) and Vr°p q ^ (A, B) are written as 
u (o,o) (A B ) = -ia\ - V I -g^j V e qd ^ 

\' xeAuB \ 1 1 yeA 

p/q' 

1 1 yeB 
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W(AB) = ^ln 



+ 




r/p 



) 



If d is the discrete metric multiplied by a positive constant A, then we have 




The functions (|4.6[) and (j4.7|) are metrics for p > and p < 0, respectively. The 
proofs of each triangle inequality are outlined in Appendix B. If p = 0, then it is 
the same situation as Example 2) By taking the limit for p = 0, we can see that 
both functions are equal to the Jaccard distance (|2.4[) except for the coefficient A. 

4.2. Hierarchical metric spaces. Suppose that (X, d) is a metric space and S(X) 
is the collection of all non-empty finite subsets of X. Let A; be a non- negative 
integer, let Sk+i(X) denote the collection of all non-empty finite subsets of Sk(X), 
and let fk be a metric on Sk(X), where Si(X), So(X), fi and fo correspond to, 
respectively, S(X), X, /, and d in Theorem[3l For k > 1, in much the same way, 
for each A and B in Sk(X), the function fk(A, B) can be defined by 



will be useful for constructing hierarchical hyperspaces. 

4.3. Duality. There is a kind of duality between sets and elements with respect 
to their distance functions. For example, we can define the functions D and d 
symmetrically as follows: 



where C(a) — {A \ a € A}. The set-to-set distance D in (|4.9p is a metric due to 
the axiom of extensionality, and the element-to-element distance d in (|4.10[) is a 
pseudo-metric. 

According to Theorem [31 D can be defined by / in (|3.ip . instead of (|4.9p . so that 
we have 



In this case, D is a pseudo-metric, depending on d, and there exist a condition that 
satisfy d(a, b) = f (f]C(a),f]C(bj). Furthermore, in the situation of Section FOI let 
(X, d) be a metric space, let Si(X) be the collection of all non-empty finite subsets 




(4.9) 
(4.10) 



D(A, B) = \{a\aE A}A{b \ b € B}\ = \AAB\ , 
d(a,b) = \{A\ae A}A{B | b e B}\ = \C(a)AC{b) 



D(A,B)=f(A,B), 
d{a,b) = \C{a)AC(b)\. 
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of X, and let C(a) = {ie Si(X) a e A} be an element of S 2 (X). If d is the 
discrete metric, then we have 

D(A,B) = fr(A,B), 
h(C(a),C(b)) = Kd(a,b), 

where n is a certain positive real number such that n < 1. If there exists d such 
that d(a,b) — f2{C(a),C(b)) , then the isometric copy of (X, d) is contained in 
(S2{X), /2J and d can be regarded as the function of D. In general, there possibly 
exist D and d that are formally expressed by 

(4.11) D{A,B) = F({d(a,b) \aeA,be B}), 

(4.12) d(a, b) = G({D(A, B) \ a e A,b e B}) , 

where F and G may be such a generalized function given in (|4.5[) . It is interesting 
to consider whether F and G really exist and what features they have. In numerical 
analysis, D and d that are consistent with each other will be obtained by the itera- 
tive computation of (|4.11[) for all A,Be <Si(A) and (|4.12l) for all a,b G X, starting 
with an initial metric space (X, d) , if each converges to a non-trivial function. 

4.4. Generalized metrics. The group average distance g(A, B) in (|2.5p can be 

regarded as a generalized metric that satisfies conditions Ml (non- negativity) , M4 
(symmetry) and M5 (triangle inequality) in Definition [T] In conventional topology, 
there has been no such generalization by dropping both conditions M2 and M3, 
which are usually combined together into the single axiom d(a, b) = a = b 
(identity, reflexivity, or coincidence) . Although M3 can be dropped for the pseudo- 
metric so as to allow d(a, b) = for a ^ b, the self-distance d(a, a) = (M2) seems 
to be indispensable in point-set topology where the element is a point having no 
size. An exception is a partial metric [9] that is defined to satisfy Ml, M3, M4 and, 
instead of M5, the following partial metric triangularity, 

(4.13) d{a, b) + d{b, c) > d(a, c) + d(b, b). 

In computer science or information science, the element of data sets is not merely 
a simple point. It may have rich contents inside. Some elements may have internal 
structures which cause non-zero self-distance, and some elements may have different 
properties each other, even though they are indiscernible from a metric point of 
view. The concept of distance can be used for measuring not only the difference 
between objects but also the cost of moving or the energy of transition between 
states. This is the reason why the generalization toward non-zero self-distance is 
worth considering. If the triangle inequality holds, it provides an upper and lower 
bound for them. The group average distance g{A, B) can be such a typical one, 
and that it is simpler and more natural than the metric f(A,B) in (|3.1|1 . 

Incidentally, the function g(A, B) is not a partial metric because it does not 
satisfy f|4. 13|) . On the other hand, f(A,B) gives an approach to an instance of 
partial metrics from a special case of (I4.7[) in Example [SJ By taking the limit as 
A — > 00 for p < and multiplying a positive constant, for non-empty finite sets A 
and B, we have the following metric, 

D„(i4,B)=Iog|AUB|-i/log(|A| 
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where v = 1/2. Its triangle inequality is equivalent to \AU B\\B U C\ > \A\J C\\B\. 
This suggests, for v e [0, 1/2), D v is a partial metric on a collection of non-empty 
finite sets. 

4.5. Extension to infinite sets. If S(X) is the collection of all non-null measur- 
able subsets of (X, d), and d is Lebesgue integrable on each element of S(X), then 
the group average distance g(A, B), for A, B £ S(X), can be defined by 

(4.14) 9 (A,B) = j A [jj^My)) M*) 

where x S A, y S B, and /z is a measure on X, and then the distance function Q3.2p 
can be extended to 

(4.15) f(A, B) = ff(A 5 \ A) + ^§ 5 (A \ -B, -B). 

/^(^4 U B) [i\A U iJ) 

If d is the discrete metric, then (|4.15[) is equal to the Steinhaus distance [J]. 

Example 7. Let (R, d) be a metric space and let d(x, y) = \x — y\. For two intervals 
A and B, the distance function (|4.15[) can be expressed as 

|sup(A) - sup(B)| + |inf(A) - inf(B) 



f(A,B) = 



2 

|sup(A) - sup(B)| |inf(A) - mf(B)\ 



[{A cB)W(Ad B)] 



sup(A U B) — inf (A U B) 

If A £ B and A ^ B (i.e., [(A C B) V (A D £?)] = 0), then /(A, B) is equal to 
the distance between the centers of A and B. This is consistent with an intuitive 
notion of the distance between balls in this (R, d). 

If S{X) is the collection of all non-empty, countably infinite subsets (measure- 
zero sets) of X, then g(A,B) and f(A,B) should be defined by taking limits in 
(|2.5j) and (|3.2j) . provided that both have definite values. In order to determine the 
average distance, we have to define a proper condition, which should be said to be 
"averageable". The average distance will strongly depend on accumulation points in 
A and B, and it will require additional assumptions on the difference of the strength 
between the accumulation points. This requirement is closely related to a "relative 
measure" that is needed to obtain the ratio of the cardinality of an infinite set to 
the cardinality of its superset in (|3.2|1 . In conventional measure theory, however, 
any set of cardinality Ho is a null set having measure zero so that both counting 
measure and Lebesgue measure are useless for computing the ratio. It is necessary 
to use another measure. A feasible solution is discussed in the following section. 

4.6. Estimation by sampling. In application to computational data analysis, 
statistical estimation by sampling is very useful for obtaining the approximate value 
of f(A, B) when the size of the sets is very large. According to the law of large 
numbers, if enough sample elements are selected randomly, an average generated 
by those samples should approximate the average of the total population. The 
procedure is as follows: 

(1) Choose a superset P of A U B as a population such that P Z) AVJ B. 

(2) Select a finite subset S of P as a sample obtained by random sampling. 

(3) Let Sa = S n A and Sb = S fl B. Then, compute J(Sa, Sb) for approxi- 
mation of f(A, B). 
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The sampling process and its randomness are crucial for efficiently estimating a good 
approximation. Some useful hints could be found in various sampling techniques 
developed for Monte Carlo methods |10| . In most cases, sampling error is expected 
to decrease as the sample size increases, except for situations where the distribution 
of d has no mean (e.g., Cauchy distribution). 

The notion of sampling suggests an intuitive measure to define a relative measure 
on a cr-algebra over a set X. which could be called "sample counting measure". 
Suppose A and B are subsets of X. Let p(A : B) be the ratio of the cardinality of 
A to the cardinality of B, let P be a superset of A U B, and let S n be a non-empty 
finite subset of P such that S n = lj" =1 Y% where Yj, is the i-th non-empty sample 
randomly selected from P. Then, the ratio p(A : B) can be determined by taking 
a limit of n as it approaches to oo as follows: 

p(A : B) — hm , 

n-s-oo |5„ H D | 

if there exist such a limit and a random choice function that performs random 
sampling. Otherwise, instead of random sampling, systematic sampling could be 
available if the elements of X are supposed to be distributed with uniform density 
in its measurable metric space. For example, suppose there exists a finite partition 
of P where every part has an almost equal diameter. It seems better for S n to have 
exactly one element with each of the parts. 

4.7. Metrics for fuzzy sets and probability distributions. A fuzzy set can 
be represented by a collection of crisp sets so that the distance between fuzzy sets 
can be defined by the distance between the collections of such crisp sets. Let A be 
a fuzzy set: A = {(x, 771^4(2;)) | x 6 X}, where m^x) is a membership function, 
and let A a be a crisp set called an a — lebel set pT| such that A a — {x G X | 
n^A^x) > a}. Then, A can be represented by the following set of ordered pairs: 
C(A) = {(A a ,a) I a € (0, 1]}. The distance between two fuzzy sets A and B can 
be defined by f 2 (C(A),C(B)), where there may be various ways to treat a. This 
notion is also applicable to the distance between probability distributions, where 
probability density functions are used instead of the membership function. 

5. Concluding Remarks 

We have found that, for a metric space (X,d), there exists a distance function 
between non-empty finite subsets of X that is a metric based on the average dis- 
tance of d. The distance function (|3.1I) in Theorem[3]is the most typical one, which 
includes the Jaccard distance as a special case where d is the discrete metric. Its 
extensions based on the power mean will be useful to develop generalized forms that 
also include the Hausdorff metric and the other various distance functions. Fur- 
thermore, the extensions to infinite subsets of X will provide metrics for measuring 
dissimilarity of fuzzy sets and probability distributions. 

Appendix A. Triangle Inequality in Theorem [3] 
The triangle inequality for 

f(A, B) = {\AUB\ \A\)- x s{A,B\A) + {\AuB\ \ B, B) 

can be proved by showing the following inequality: 

|^| \B\ \C\ \A U B\ \B U C\ \A U C\ (f(A, B) + f(B, C) - f(A, C)) > 0. 
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Let A U B U C be decomposed into the following seven disjoint sets: 

a = A\(Bl)C), (3 = B\(AUC), 7 = C\(iUB), 

^ = AnB\C, e = BnC\A, ( = Cn4\B, r) = Ar\BnC, 

and let 9 = B \ /3 = B n (A U C), so that we have 

A = aU£UCUr?, |A| = |a| + \5\ + \(\ + \ V \ , 

B = f3uSUeUri, \B\ = \/3\ + \S\ + \e\ + \v\ , 

C = 7 U £ UCUt ? , \C\ = \ 7 \ + \e\ + \C\ + \ V \, 

9 = 51)61)7], \6\ = \5\ + \e\ + \r)\. 

Taking account of (|2.8[) and (|2.9p , we have 

\A\ \B\ \C\ \A U B| |B U C\ \A U C\ (/(A B) + /(B, C) - f(A, C)) 
= \B\ \C\ \B U C\ \A U C| s(A B \ A) + | A| |C| |B U C| |A U C\ s{A \ B, B) 
+ \A\ \C\ \A U B\ \A U C| s(B, C\B) + \A\ \B\ \A U B\ \ A U C| s(B \ C, C) 

- \B\ \C\ \A U B| |B U C| s(A, C \ A) - |A| |B| \A U B| |B U C| s(A \ C, C) 

= |B| |C| (| 7 | |S U C| S (A B \ A) + (| 7 | M + \B U C| |C \ A|) S (A, P) 

+ |B U C| | A| \ C, /3) + s(A n C, /?)) + (|/3| | 7 | + \B U C| |A U e|)s(A, e)) 
+ |A| \C\ ((| 7 | |^UC| + |C| | 7 | + ICI |A U e|) 8 (a, B) 

+ |B| | A U e\ (s(a, B \ C) + s(a, B n C)) + |B| | 7 | (s(a, /3) + s(a, 9)) 

+ (|<5 U C\ | 7 | + |«5 U C\ \A U e| + |/3| |A U C|) S (C, B)) 
+ |A| |C| ((|a| |A U C| + |C| |a| + ICI |<5 U C\)s(B, 7 ) 

+ |B| |<J U C\ (s{B \ A, 7 ) + s(A n B )7 )) + |B| |a| (s(/3, 7 ) + S (0, 7 )) 

+ (| A U e\ \a\ + \A U e\ \S U C| + |/3| | A U C|)s(B, C)) 
+ \A\ \B\ (\a\ \A U e\ s(B \C,C) + (\a\ | 7 | + |CU B| \A\ C\)s((3, C) 

+ |C U B| |C| (s(/3, inC) + s(/3, C \ A)) + (|/3| |a| + |A U B| |<5 U C|)s(5, C)) 

- |B| |C| (|C U B| \0\ s(A, C\A) + (|a| |/3| + |B \ A| |£ U C|)s(A 7 ) 

+ \A\ \SUC\ (s(aUC, 7 ) + s(AnB, 7 )) + (|/3| | 7 | + |AUe| |B U C|)s(A, e)) 
- \A\ \B\ (|/3| |BUC|s(A\C,C) + (|)8| | 7 | + |AUs| |B\C|) S (a,C) 

+ \A U e| |C| (s(a, C U 7) + s{a, B n C)) + (|a| |/3| + |A U B| |(5 U C|)s(<5, C)) 

= |B| |C| (\S U C\ t(A, B \ A, 7) + H /?, 7 ) + |B U CI t(A /3, (7 \ A)) 
+ \A\ |B| (|A U e| t(a, B \ C, C) + | 7 | t(a, /3, C) + |B U CI t(A \ C, /3, C)) 
+ |A| |C7| ((|AUd + K\)t(a,B, 7 ) + \B\t(a,9, 7 )) 
+ |A| |d (\A U e| t(a, B, Q + \C U 5| t(C, B, 7 )) 
+ 2 |A| |d (I* U C\ \A U e| + |/3| |A U C|)s(B, C) 
+ 2 |A| |B| |d \B U CI s(/3, A n C) > 0. 

where the equality holds if all terms of s and t are zero. 
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Appendix B. Triangle Inequalities of Example |6] 
The triangle inequality for (|4.6| can be proved as follows: Let x = e pX and let 
x\AAB\ + \Af)B\\ fx\BAC\ + \Bf)C\\ fx \AAC\ + \A n C\" 



T ^ V \AUB\ J V \BUC\ J V \AUC\ 
Then the triangle inequality for p > is equivalent to t(x) > for x > 1. The first 
derivative of r(a;) with respect to a; is 

r'^) = 2(a; - l)j(A, B)j(B, C) + j(A, B) + j(B, C) - j(A, C), 

where j is the Jaccard distance (|2.4I) . Since r(l) = and r'(a;) > for x > 1, we 
have r(e pA ) > for p > 0. 

The triangle inequality for (|4.7p can be proved as follows: Let y = 1 — e pA and 

let 

/ |A\C| \ / \C\A\ 
T{V) = l, 1 - jAUcf) {'-jAUcf 



l^usry v |Ausry v \Buc\y \ \buc\- 

where 4>{y) is the cubic function of y with a negative leading coefficient. Then the 
triangle inequality for p < is equivalent to r(y) > for y £ [0, 1). The function 
<p(y) satisfies the following inequalities: 0(0) > 0, 0(1) = r(l) > 0, </>'(l) < 0, and 
0"(1) > 0. These inequalities can be proved by decomposition of A, B, and C into 
a, fi, 7, S, e, C and 77 defined in Appendix A. Then, we have <p(y) > for y S [0, 1), 
therefore, r(l - e pA ) > for p < 0. 
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